In [1]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
Out[1]:
In [2]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import pickle
import os

from platform import python_version
print('python',python_version())

import matplotlib
font = {'family' : 'normal',
        'weight' : 'bold',
        'size'   : 15}

matplotlib.rc('font', **font)

%matplotlib inline
plt.style.use('dark_background')
python 3.5.2
  • Objective: The objective of the behavioral method is to identify home/workplace from locations of individuals based on the observed longitudinal presences.
    • Note that there is no one-to-one correspondence between activities and user locations, meaning that people perform a particular activity at different user locations and people perform several activities at the same user location.
  • Raw data: each user has a sequence of coordinates with timestamps across the observational period
    • Each record in the raw data is a "presence", an assumed activity type or trip purpose of a person
  • Unit of interest: person-location
  • Feature:

    • Normalized Hourly Presence (NHP) - used as a set of hourly features
      • characterizes the temporal presence patterns of individuals at user location
      • acts as a vector of probability distribution capturing the likelihood that the person will appear in that location for a given time of day
      • is computed as the average number of presences in each hour of a weekday (or a weekend) for a particular user-location
      • is computed per person per hour per location
      • weekdays and weekends are aggregated separately because we expect different pattern of schedules (e.g. people have more spare time during weekend)
      • each person-location is characterized by 48 Normalized Hourly Presence (NHPs)
        • 24 hrs for weekday
        • 24 hrs for weekend
  • Target input:

    • After preprocessing, the raw data should be transformed into this kind of matrix format:
                wdhr0 wdhr1 wdhr2 ... wdh23 wehr0 wehr1 wehr2 ... weh23
        id1loc1
        id1loc2
        id2loc1
        .
        .
        .
        idnlocn

1. Preprocessing

a. Load data

In [4]:
data = pd.read_csv('data.csv',header=None)
data.columns = ['ID','make','device','start_time','end_time','lat','long','place','strength']
data['dummy'] = 1
data.shape
Out[4]:
(13257977, 10)

b. First 5 rows

In [5]:
data.iloc[:,1:].head()
Out[5]:
make device start_time end_time lat long place strength dummy
0 SIMCOM SIM7100C (M2M MODULE) 2019-07-19 00:16:00 2019-07-19 00:16:10 25.03 121.446 Outdoor Stationary High 1
1 SIMCOM SIM7100C (M2M MODULE) 2019-07-19 00:17:30 2019-07-19 00:17:40 25.03 121.446 Outdoor Stationary High 1
2 SIMCOM SIM7100C (M2M MODULE) 2019-07-19 00:19:00 2019-07-19 00:19:10 25.03 121.446 Outdoor Stationary High 1
3 SIMCOM SIM7100C (M2M MODULE) 2019-07-19 00:45:54 2019-07-19 00:46:04 25.03 121.447 Outdoor Stationary High 1
4 SIMCOM SIM7100C (M2M MODULE) 2019-07-19 00:47:24 2019-07-19 00:47:34 25.03 121.447 Outdoor Stationary High 1

c. Data types

In [6]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13257977 entries, 0 to 13257976
Data columns (total 10 columns):
ID            int64
make          object
device        object
start_time    object
end_time      object
lat           float64
long          float64
place         object
strength      object
dummy         int64
dtypes: float64(2), int64(2), object(6)
memory usage: 1011.5+ MB
In [7]:
print('Number of unique IDs: {}'.format(len(set(data.ID))))
print('Min start time: {}\tMax start time: {}'.format(min(data.start_time), max(data.start_time)))
print('Min end time: {}\tMax end time: {}'.format( min(data.end_time), max(data.end_time)))
Number of unique IDs: 200
Min start time: 2019-07-18 23:33:13	Max start time: 2019-08-01 23:59:59
Min end time: 2019-07-19 00:00:00	Max end time: 2019-08-01 23:59:59

1.1 Frequency counts of place-strength

In [8]:
cnt = data[['place','dummy']].groupby(['place']).sum().reset_index()
cnt.columns = ['place','count']
cnt['percentage'] = 100*(cnt['count']/len(data))
cnt
Out[8]:
place count percentage
0 Indoor 4605688 34.738995
1 Mobile 1548825 11.682212
2 Outdoor Stationary 7103313 53.577654
In [9]:
cnt = data[['strength','dummy']].groupby(['strength']).sum().reset_index()
cnt.columns = ['strength','count']
cnt['percentage'] = 100*(cnt['count']/len(data))
cnt
Out[9]:
strength count percentage
0 High 11380052 85.835509
1 Low 411649 3.104916
2 Medium 1466125 11.058437
In [10]:
cnt = data[['place','strength','dummy']].groupby(['place','strength']).sum().reset_index()
cnt.columns = ['place','strength','count']
cnt['percentage'] = 100*(cnt['count']/len(data))
cnt
Out[10]:
place strength count percentage
0 Indoor High 4036823 30.448258
1 Indoor Low 91090 0.687058
2 Indoor Medium 477775 3.603680
3 Mobile High 1122371 8.465628
4 Mobile Low 128320 0.967870
5 Mobile Medium 298134 2.248714
6 Outdoor Stationary High 6220858 46.921623
7 Outdoor Stationary Low 192239 1.449987
8 Outdoor Stationary Medium 690216 5.206043

1.2. Create start and end hour time

In [11]:
day = (5,10)
hour = (11,13)

data['start_day'] = data['start_time'].map(lambda x: x[day[0]:day[1]])
data['start_hour'] = data['start_time'].map(lambda x: x[hour[0]:hour[1]])
data['end_hour'] = data['end_time'].map(lambda x: x[hour[0]:hour[1]])

data['hour'] = data.start_hour + '-' + data.end_hour
data_agg = data[['ID','start_day','hour','lat','long','dummy']].groupby(['ID','start_day','hour','lat','long']).sum().reset_index()
data_agg.columns = ['id','s_day','hour','lat','long','count']

print(sorted(set(data_agg.s_day)))
print(data.shape, data_agg.shape, len(set(data_agg.hour)),sorted(set(data_agg.hour)))
['07-18', '07-19', '07-20', '07-21', '07-22', '07-23', '07-24', '07-25', '07-26', '07-27', '07-28', '07-29', '07-30', '07-31', '08-01']
(13257977, 14) (771496, 6) 66 ['00-00', '00-01', '00-02', '00-06', '01-01', '01-02', '01-03', '02-02', '02-03', '03-03', '03-04', '04-04', '04-05', '04-08', '05-05', '05-06', '05-07', '06-06', '06-07', '07-07', '07-08', '07-09', '08-08', '08-09', '08-10', '08-11', '09-09', '09-10', '09-11', '10-10', '10-11', '10-12', '11-11', '11-12', '12-12', '12-13', '13-13', '13-14', '13-15', '14-14', '14-15', '14-16', '15-15', '15-16', '15-17', '16-16', '16-17', '16-18', '17-17', '17-18', '17-19', '18-18', '18-19', '19-19', '19-20', '20-20', '20-21', '20-22', '21-01', '21-21', '21-22', '22-22', '22-23', '23-00', '23-01', '23-23']

1.3. Plot distribution of Presence count

In [12]:
df_count = data_agg[['id','s_day','count']].groupby(['id','s_day']).count().reset_index()

ax = df_count['count'].hist(bins=100)
ax.set_xlabel('Presence')
ax.set_ylabel('Frequency')
Out[12]:
Text(0, 0.5, 'Frequency')
/usr/local/lib/python3.5/dist-packages/matplotlib/font_manager.py:1241: UserWarning: findfont: Font family ['normal'] not found. Falling back to DejaVu Sans.
  (prop.get_family(), self.defaultFamily[fontext]))

1.4. Add weekday, weekend features in data_agg

In [13]:
#['07-18', '07-19', '07-20', '07-21', '07-22', '07-23', '07-24', '07-25', '07-26', '07-27', '07-28', '07-29', '07-30', '07-31', '08-01']
data_agg['day'] = data_agg.s_day.map(lambda x: 'we' if x in ['07-20','07-21','07-27','07-28'] else 'wd')
data_agg['latlong'] = data_agg.lat.map(str) + '-' + data_agg.long.map(str)
data_agg.sort_values(['id','s_day','hour'],ascending=[True,True,True], inplace=True)
data_agg.shape
Out[13]:
(771496, 8)

1.5. Process the hour feature

In [14]:
def process_hour(hour_str):
    
    #start and end hour is the same
    if hour_str[:2] == hour_str[-2:]:
        hr_end = int(hour_str[-2:]) + 1
        hr_end = "{0:02d}".format(hr_end)
        hour_str = hour_str[:2] +'-'+ hr_end
        
    # end hour that is '00' should be '24'
    if hour_str[-2:] == '00':
        hr_end = "{0:02d}".format(24)
        hour_str = hour_str[:2] +'-'+ hr_end
        
    # end hour that is less than the start hour, means extends to the next day? just put it to 24
    if (int(hour_str[-2:]) < int(hour_str[:2])): #end<start [23:01]
        hr_end = "{0:02d}".format(24)
        hour_str = hour_str[:2] +'-'+ hr_end
    
    return hour_str

def process_more_hours(df):
    idxs_to_exclude = list(range(len(df)))
    for row in range(len(df)):
        df_id, df_day,df_hr,df_loc,df_cnt =  df.iloc[row]['id'],df.iloc[row]['day'],df.iloc[row]['hr'], df.iloc[row]['latlong'], df.iloc[row]['count']
        hr_start = int(df_hr[:2])
        hr_end = int(df_hr[-2:])
        for hr in list(range(hr_start,hr_end,1)):
            hr_new = "{0:02d}".format(hr) +'-'+ "{0:02d}".format(hr+1)
            df_add = pd.DataFrame({"id":[df_id], 
                                    "day":[df_day],  
                                    "hr":[hr_new],
                                    "latlong":[df_loc],
                                    "count":[df_cnt] })
            df = df.append(df_add, ignore_index = True,sort=True) 
    df.drop(idxs_to_exclude,inplace=True)
    df.drop(['index'], axis=1, inplace=True)
    
    return df

data_agg['hr'] = data_agg.hour.map(lambda x: process_hour(x))

data_agg['hr_range'] = data_agg.hr.map(lambda x: int(x[-2:])-int(x[:2]))
df_1hr = data_agg[data_agg.hr_range <= 1][['id','day','hr','latlong','count']]
df_gt1hr = data_agg[data_agg.hr_range >= 2][['id','day','hr','latlong','count']].reset_index()
assert df_1hr.shape[0] + df_gt1hr.shape[0] == data_agg.shape[0]
df = process_more_hours(df_gt1hr)

data_agg = pd.concat([df_1hr,df], axis=0,sort=False)
assert df_1hr.shape[0] + df.shape[0] == data_agg.shape[0]

data_agg_sum = data_agg.groupby(['id', 'day', 'hr', 'latlong']).sum().reset_index()
data_agg_sum.sort_values(['id', 'day', 'hr', 'latlong'], inplace=True)
data_agg_sum.shape

print('Processed hours: {}\nTotal number of hours:{}'.format(sorted(set(data_agg_sum.hr)),len(set(data_agg_sum.hr))))
Processed hours: ['00-01', '01-02', '02-03', '03-04', '04-05', '05-06', '06-07', '07-08', '08-09', '09-10', '10-11', '11-12', '12-13', '13-14', '14-15', '15-16', '16-17', '17-18', '18-19', '19-20', '20-21', '21-22', '22-23', '23-24']
Total number of hours:24
In [15]:
data_agg_sum['hr'] = data_agg_sum.hr.map(lambda x: x.replace('-','_'))
data_agg_sum['wday_hr'] = data_agg_sum.day + data_agg_sum.hr
data_agg_sum.head()
Out[15]:
id day hr latlong count wday_hr
0 5468973 wd 00_01 24.677-121.766 49 wd00_01
1 5468973 wd 00_01 24.994-121.48200000000001 1 wd00_01
2 5468973 wd 00_01 24.994-121.48299999999999 1 wd00_01
3 5468973 wd 00_01 24.995-121.48200000000001 15 wd00_01
4 5468973 wd 00_01 24.995-121.48299999999999 8 wd00_01

1.6. Pivot the table

a. Pivot to make the 48 weekday/weekend hour features

In [16]:
df_pivot = pd.pivot_table(data_agg_sum, values = 'count', index=['id','latlong'], columns = 'wday_hr').reset_index()
df_pivot  = df_pivot.fillna(0)
assert len(set(df_pivot.id.astype(str) + ':'+ df_pivot.latlong)) == df_pivot.shape[0]
df_pivot.shape
Out[16]:
(288610, 50)

b. Quick stats

In [17]:
df_pivot.describe()
Out[17]:
wday_hr id wd00_01 wd01_02 wd02_03 wd03_04 wd04_05 wd05_06 wd06_07 wd07_08 wd08_09 ... we14_15 we15_16 we16_17 we17_18 we18_19 we19_20 we20_21 we21_22 we22_23 we23_24
count 2.886100e+05 288610.000000 288610.000000 288610.000000 288610.000000 288610.000000 288610.000000 288610.000000 288610.000000 288610.000000 ... 288610.000000 288610.000000 288610.000000 288610.000000 288610.000000 288610.000000 288610.000000 288610.000000 288610.000000 288610.000000
mean 5.309966e+07 1.429559 1.438446 1.450636 1.444004 1.443283 1.443872 1.414407 1.396542 1.359496 ... 0.534936 0.538117 0.538879 0.544527 0.549118 0.548553 0.549531 0.546540 0.549537 0.562597
std 1.910071e+07 32.639563 33.439788 33.099019 33.068012 33.347389 33.166925 32.305800 29.282213 27.800125 ... 12.349223 12.226646 12.681764 12.978861 13.447498 13.507039 14.113446 13.981886 14.500623 14.432229
min 5.468973e+06 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 5.880484e+07 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 6.254789e+07 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 6.276455e+07 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
max 6.417020e+07 6505.000000 6518.000000 6522.000000 6522.000000 6514.000000 6540.000000 6545.000000 5923.000000 5619.000000 ... 1153.000000 1562.000000 1670.000000 1738.000000 1589.000000 1330.000000 2258.000000 2315.000000 2406.000000 2456.000000

8 rows × 49 columns

1.7. Normalize the hourly presences

  • NHP(i,h,l): NHP for individual i during hour h at location l
  • AHP(i,h,l): Absolute hourly presence for individual during hour h at location l
  • L : ranges from [1,48] representes 24 hours on weekdays and 24 hours on weekends
  • L(i,h) : the unique set of user locations for individual i during hour h
  • N(i,h,l) = AHP(i,h,l)/Sum_l1_Li(AHP(i,h,l))
In [18]:
cols = ['wd00_01', 'wd01_02', 'wd02_03', 'wd03_04', 'wd04_05',
       'wd05_06', 'wd06_07', 'wd07_08', 'wd08_09', 'wd09_10', 'wd10_11',
       'wd11_12', 'wd12_13', 'wd13_14', 'wd14_15', 'wd15_16', 'wd16_17',
       'wd17_18', 'wd18_19', 'wd19_20', 'wd20_21', 'wd21_22', 'wd22_23',
       'wd23_24', 'we00_01', 'we01_02', 'we02_03', 'we03_04', 'we04_05',
       'we05_06', 'we06_07', 'we07_08', 'we08_09', 'we09_10', 'we10_11',
       'we11_12', 'we12_13', 'we13_14', 'we14_15', 'we15_16', 'we16_17',
       'we17_18', 'we18_19', 'we19_20', 'we20_21', 'we21_22', 'we22_23',
       'we23_24']

colswd = [col for col in cols if 'wd' in col ]
colswe = [col for col in cols if 'we' in col ]

1.7.1. Sum of presences for each individual and particular hour

In [19]:
df_pivot_sum_loc = df_pivot.groupby(['id']).sum().reset_index()
df_pivot_merge_sum = pd.merge(df_pivot, df_pivot_sum_loc, on='id', how='left')

orig_cols = [col for col in df_pivot_merge_sum.columns if '_x' in col]
for col in orig_cols:
    df_pivot_merge_sum[col[:len(col)-2]] = df_pivot_merge_sum[col].map(float) / df_pivot_merge_sum[col[:len(col)-2] + '_y']
    
interested_cols = [col for col in df_pivot_merge_sum.columns if ('_x' not in col) & ('_y' not in col)]
df_pivot_merge_sum = df_pivot_merge_sum[interested_cols].fillna(0.0)

df_pivot_merge_sum['IDloc'] = df_pivot_merge_sum['id'].map(str) + '_' + df_pivot_merge_sum.latlong
df_pivot_merge_sum.drop(['id','latlong'],axis=1, inplace=True)

a. Quick stats

In [20]:
df_pivot_merge_sum.describe()
Out[20]:
wday_hr wd00_01 wd01_02 wd02_03 wd03_04 wd04_05 wd05_06 wd06_07 wd07_08 wd08_09 wd09_10 ... we14_15 we15_16 we16_17 we17_18 we18_19 we19_20 we20_21 we21_22 we22_23 we23_24
count 288610.000000 288610.000000 288610.000000 288610.000000 288610.000000 288610.000000 288610.000000 288610.000000 288610.000000 288610.000000 ... 288610.000000 288610.000000 288610.000000 288610.000000 288610.000000 288610.000000 288610.000000 288610.000000 288610.000000 288610.000000
mean 0.000690 0.000690 0.000690 0.000690 0.000690 0.000690 0.000693 0.000693 0.000693 0.000693 ... 0.000679 0.000683 0.000679 0.000676 0.000676 0.000676 0.000676 0.000676 0.000679 0.000679
std 0.014383 0.014553 0.014296 0.014329 0.014764 0.014638 0.014344 0.013042 0.012905 0.012321 ... 0.014924 0.014469 0.014554 0.014972 0.015684 0.015910 0.015715 0.015850 0.015988 0.016108
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
max 0.999579 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

8 rows × 48 columns

In [21]:
exclude = ['id','latlong']
for col in exclude:
    interested_cols.remove(col)

b. Final preprocessed data: show first 5 rows

In [22]:
df_pivot_merge_sum.head()
Out[22]:
wday_hr wd00_01 wd01_02 wd02_03 wd03_04 wd04_05 wd05_06 wd06_07 wd07_08 wd08_09 wd09_10 ... we15_16 we16_17 we17_18 we18_19 we19_20 we20_21 we21_22 we22_23 we23_24 IDloc
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.003257 0.000000 0.0 0.0 0.0 0.0 0.0 5468973_24.576999999999998-121.863
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.078176 0.073529 0.0 0.0 0.0 0.0 0.0 5468973_24.576999999999998-121.86399999999999
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.003257 0.007353 0.0 0.0 0.0 0.0 0.0 5468973_24.578000000000003-121.86399999999999
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.003257 0.062500 0.0 0.0 0.0 0.0 0.0 5468973_24.578000000000003-121.865
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 5468973_24.58-121.874

5 rows × 49 columns

PCA

Assumption: human’s presence patterns at user locations with similar functions are similar across the sample population

In [23]:
from sklearn.decomposition import PCA
In [24]:
features = interested_cols
x = df_pivot_merge_sum.loc[:,features].values
print('data shape: ',np.shape(x))

pca = PCA(random_state = 8888) #(n_components=3)

principalComponents = pca.fit_transform(x)

#principalDf = pd.DataFrame(data = principalComponents
#             , columns = ['prinComp1', 'prinComp2','prinComp3'])

# for new data, just use:
# pca_x_new = pca.transform(x_new)  
data shape:  (288610, 48)

a. 48 PCs

In [25]:
principalDf = pd.DataFrame(data = principalComponents)
principalDf.head()
Out[25]:
0 1 2 3 4 5 6 7 8 9 ... 38 39 40 41 42 43 44 45 46 47
0 -0.004221 0.000683 -0.000492 0.000563 -0.000241 -0.000041 -0.000728 0.000262 -0.000712 -0.000707 ... -0.000038 0.000218 0.000115 -0.000137 0.000357 -0.000017 -0.000074 0.000042 0.000006 -3.380610e-05
1 0.018586 0.033541 -0.004263 0.017564 -0.004823 -0.017106 -0.031177 0.014076 -0.039815 -0.012848 ... -0.000371 0.000867 0.005727 -0.003064 0.002601 0.001647 -0.001946 0.002515 0.000129 -1.200295e-03
2 -0.003062 0.002178 -0.000742 0.001780 -0.000307 -0.001331 -0.002169 0.001310 -0.002966 -0.000278 ... 0.000052 -0.000200 0.000401 -0.000110 -0.000217 0.000184 -0.000073 0.000190 0.000012 -6.944714e-05
3 0.005625 0.013388 -0.002618 0.010902 -0.000801 -0.011008 -0.012978 0.009166 -0.019874 0.002943 ... 0.000722 -0.003328 0.002549 0.000093 -0.004522 0.001692 -0.000070 0.001299 0.000051 -3.367549e-04
4 -0.004281 0.000655 -0.000661 -0.000079 -0.000392 0.000708 -0.000451 -0.000507 0.000535 0.000202 ... -0.000093 0.000201 0.000226 -0.000026 0.000048 -0.000049 0.000065 0.000098 -0.000027 -3.874482e-08

5 rows × 48 columns

2.1. Eigenvalues and eigenvectors

  • eigenvector = eigenlocation: represents a typical presence pattern by explaining a portion of the behavioral presences variances.
In [26]:
#eigenvectors
print('eigenvectors shape: ', np.shape(pca.components_))
eigenvectors shape:  (48, 48)
In [27]:
# eigenvalues
print('eigenvalues: ', pca.explained_variance_)
print('explained_variance of each eigenvalue: ',pca.explained_variance_ratio_)
eigenvalues:  [6.13353015e-03 5.57985914e-04 4.60076382e-04 3.87160151e-04
 2.74574464e-04 2.12416297e-04 1.71916896e-04 1.50095561e-04
 1.30156407e-04 1.11169552e-04 1.00941950e-04 8.71465726e-05
 8.44353025e-05 7.61701244e-05 6.95239106e-05 6.09960886e-05
 5.72023188e-05 5.53037810e-05 5.27568361e-05 4.75590559e-05
 4.21152302e-05 4.18428811e-05 3.93829201e-05 3.82953833e-05
 3.57124107e-05 3.41775771e-05 3.27049784e-05 3.06263163e-05
 2.93531174e-05 2.78686357e-05 2.62630167e-05 2.57494461e-05
 2.43975045e-05 2.35407433e-05 2.31770136e-05 2.16560186e-05
 2.11908082e-05 1.90706502e-05 1.84086884e-05 1.71535071e-05
 1.55257727e-05 1.45572761e-05 1.41292757e-05 1.32991898e-05
 1.19318524e-05 1.16369019e-05 1.06444272e-05 9.99445818e-06]
explained_variance of each eigenvalue:  [0.61609317 0.05604787 0.04621318 0.03888898 0.02758011 0.02133653
 0.01726849 0.01507661 0.01307379 0.01116662 0.01013929 0.00875359
 0.00848125 0.00765104 0.00698345 0.00612686 0.00574579 0.00555509
 0.00529925 0.00477715 0.00423034 0.00420298 0.00395589 0.00384665
 0.0035872  0.00343303 0.00328511 0.00307631 0.00294843 0.00279931
 0.00263803 0.00258645 0.00245065 0.00236459 0.00232806 0.00217528
 0.00212855 0.00191558 0.00184909 0.00172301 0.00155951 0.00146223
 0.00141924 0.00133586 0.00119852 0.00116889 0.0010692  0.00100391]

2.2. Loadings

Looking under the hood of PCA, on GitHub, reveals that the fitting PCA.fit method is just a wrapper for PCA._fit(), which returns the PCA object itself to allow for chaining method calls. The fit() method performs a SVD on the data matrix, and sets the field pca.components to the first n_components columns of the right singular matrix. The rows of this new matrix will be the Loading points!

In [28]:
loadings = pca.components_
In [29]:
plt.figure(figsize=(10,5))
plt.plot(loadings[0],color='pink',label='First')
plt.plot(loadings[1],color='lightblue',label='Second')
plt.plot(loadings[2],color='red',label='Third')
plt.plot(loadings[3],color='blue',label='Fourth')
plt.axvline(x=7,linestyle='--',color='gray')
plt.axvline(x=18,linestyle='--',color='gray')
plt.axvline(x=24,linestyle='-',color='white')
plt.axvline(x=31,linestyle='--',color='gray')
plt.axvline(x=40,linestyle='--',color='gray')
plt.xlabel('Hour')
plt.ylabel('Loadings')
plt.legend()
Out[29]:
<matplotlib.legend.Legend at 0x7fbc8f0ad9b0>
In [30]:
plt.figure(figsize=(10,5))
plt.plot(loadings[4],color='pink',label='Fifth')
plt.plot(loadings[5],color='lightblue',label='Sixth')
plt.plot(loadings[6],color='red',label='Seventh')
plt.plot(loadings[7],color='blue',label='Eight')
plt.axvline(x=7,linestyle='--',color='gray')
plt.axvline(x=18,linestyle='--',color='gray')
plt.axvline(x=24,linestyle='-',color='white')
plt.axvline(x=31,linestyle='--',color='gray')
plt.axvline(x=40,linestyle='--',color='gray')
plt.xlabel('Hour')
plt.ylabel('Loadings')
plt.legend()
Out[30]:
<matplotlib.legend.Legend at 0x7fbc8f093fd0>

From 7th onwards, the pattern is getting more like noise...

In [43]:
plt.figure(figsize=(10,5))
plt.plot(loadings[6],color='red',label='Seventh')
plt.plot(loadings[7],color='blue',label='Eight')
plt.axvline(x=7,linestyle='--',color='gray')
plt.axvline(x=18,linestyle='--',color='gray')
plt.axvline(x=24,linestyle='-',color='white')
plt.axvline(x=31,linestyle='--',color='gray')
plt.axvline(x=40,linestyle='--',color='gray')
plt.xlabel('Hour')
plt.ylabel('Loadings')
plt.legend()
Out[43]:
<matplotlib.legend.Legend at 0x7fbb77e287f0>

2.2.1. Loadings for "Work"

In [32]:
plt.figure(figsize=(10,5))
#plt.plot(loadings[0],color='pink',label='First')
plt.plot(loadings[1],color='lightblue',label='Second')
plt.plot(loadings[2],color='red',label='Third')
#plt.plot(loadings[3],color='blue',label='Fourth')
#plt.plot(loadings[4],color='gray',label='Fifth')
#plt.plot(loadings[5],color='white',label='Sixth')
plt.axvline(x=7,linestyle='--',color='gray')
plt.axvline(x=18,linestyle='--',color='gray')
plt.axvline(x=24,linestyle='-',color='white')
plt.axvline(x=31,linestyle='--',color='gray')
plt.axvline(x=40,linestyle='--',color='gray')
plt.xlabel('Hour')
plt.ylabel('Loadings')
plt.legend()
# 2 suggests that people work more on the weekend; while
# 3 suggests that people work more on the weekday
Out[32]:
<matplotlib.legend.Legend at 0x7fbc8c947fd0>

2.2.2. Loadings for "Home"

In [33]:
plt.figure(figsize=(10,5))
#plt.plot(loadings[0],color='pink',label='First')
#plt.plot(loadings[1],color='lightblue',label='Second')
#plt.plot(loadings[2],color='red',label='Third')
plt.plot(loadings[3],color='blue',label='Fourth')
#plt.plot(loadings[4],color='gray',label='Fifth')
plt.plot(loadings[5],color='white',label='Sixth')
plt.axvline(x=7,linestyle='--',color='gray')
plt.axvline(x=18,linestyle='--',color='gray')
plt.axvline(x=24,linestyle='-',color='white')
plt.axvline(x=31,linestyle='--',color='gray')
plt.axvline(x=40,linestyle='--',color='gray')
plt.xlabel('Hour')
plt.ylabel('Loadings')
plt.legend()
# 4 and 6 have similar pattern on weekdays
# 4 suggests that during weekend, people stay home a lot in the evening, but are somewhere elese in the morning; while
# 6 suggests that during weekend, people are at home in the morning, but are somewhere else in the evening
Out[33]:
<matplotlib.legend.Legend at 0x7fbc8c90e8d0>

2.3. Explained variances

In [34]:
#Calculating Eigenvecors and eigenvalues of Covariance matrix
mean_vec = np.mean(x, axis=0)
cov_mat = np.cov(x.T)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)

# Create a list of (eigenvalue, eigenvector) tuples
eig_pairs = [ (np.abs(eig_vals[i]),eig_vecs[:,i]) for i in range(len(eig_vals))]

# Sort from high to low
eig_pairs.sort(key = lambda x: x[0], reverse= True)

# Calculation of Explained Variance from the eigenvalues
tot = sum(eig_vals)
var_exp = [(i/tot)*100 for i in sorted(eig_vals, reverse=True)] # Individual explained variance
cum_var_exp = np.cumsum(var_exp) # Cumulative explained variance
In [35]:
# PLOT OUT THE EXPLAINED VARIANCES SUPERIMPOSED 
plt.figure(figsize=(16, 8))
plt.bar(range(len(var_exp)), var_exp, alpha=0.3333, align='center', label='individual explained variance', color = 'g')
plt.step(range(len(cum_var_exp)), cum_var_exp, where='mid',label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(loc='best')
plt.show()

3. K-Means

In [36]:
from sklearn.cluster import KMeans

3.1. Group into four clusters

3.1.1. with 1st PC

In [129]:
num_clusters = 4
kmeans = KMeans(n_clusters = num_clusters,random_state = 8888)

#Compute cluster centers and predict cluster indices
X_clustered = kmeans.fit_predict(principalComponents[:,[0,1,2,3,4,5]])

df_pivot_merge_sum['pca_cluster'] = X_clustered
print(df_pivot_merge_sum[['pca_cluster','IDloc']].groupby(['pca_cluster']).count().reset_index())

cluster_0 = df_pivot_merge_sum[df_pivot_merge_sum.pca_cluster==0][interested_cols]
cluster_1 = df_pivot_merge_sum[df_pivot_merge_sum.pca_cluster==1][interested_cols]
cluster_2 = df_pivot_merge_sum[df_pivot_merge_sum.pca_cluster==2][interested_cols]
cluster_3 = df_pivot_merge_sum[df_pivot_merge_sum.pca_cluster==3][interested_cols]
wday_hr  pca_cluster   IDloc
0                  0  287645
1                  1      48
2                  2     711
3                  3     206
In [131]:
#Home hours: home hours (before 8 am and after 7 pm) as most other research
plt.figure(figsize=(10,6))
plt.plot(list(cluster_0.mean()), color='gray', label='Others')
plt.plot(list(cluster_1.mean()), color='orange',label='Home1')
plt.plot(list(cluster_2.mean()), color='green', label='Work?')
plt.plot(list(cluster_3.mean()), color='red', label='Home2?')
plt.axvline(x=7,linestyle='--',color='gray')
plt.axvline(x=18,linestyle='--',color='gray')
plt.axvline(x=24,linestyle='-',color='white')
plt.axvline(x=31,linestyle='--',color='gray')
plt.axvline(x=40,linestyle='--',color='gray')
plt.xlabel('Hour')
plt.ylabel('Normalize Hourly Presences')
plt.legend()
Out[131]:
<matplotlib.legend.Legend at 0x7fbc7d80c588>

3.1.2. without 1st PC

In [132]:
num_clusters = 4
kmeans = KMeans(n_clusters = num_clusters,random_state = 8888)

#Compute cluster centers and predict cluster indices
X_clustered = kmeans.fit_predict(principalComponents[:,[1,2,3,4,5]])

df_pivot_merge_sum['pca_cluster'] = X_clustered
print(df_pivot_merge_sum[['pca_cluster','IDloc']].groupby(['pca_cluster']).count().reset_index())

cluster_0 = df_pivot_merge_sum[df_pivot_merge_sum.pca_cluster==0][interested_cols]
cluster_1 = df_pivot_merge_sum[df_pivot_merge_sum.pca_cluster==1][interested_cols]
cluster_2 = df_pivot_merge_sum[df_pivot_merge_sum.pca_cluster==2][interested_cols]
cluster_3 = df_pivot_merge_sum[df_pivot_merge_sum.pca_cluster==3][interested_cols]
wday_hr  pca_cluster   IDloc
0                  0  287982
1                  1     167
2                  2     265
3                  3     196
In [133]:
#Home hours: home hours (before 8 am and after 7 pm) as most other research
plt.figure(figsize=(10,6))
plt.plot(list(cluster_0.mean()), color='gray', label='Others')
plt.plot(list(cluster_1.mean()), color='orange',label='Home1')
plt.plot(list(cluster_2.mean()), color='red', label='Home2')
plt.plot(list(cluster_3.mean()), color='green', label='Work')
plt.axvline(x=7,linestyle='--',color='gray')
plt.axvline(x=18,linestyle='--',color='gray')
plt.axvline(x=24,linestyle='-',color='white')
plt.axvline(x=31,linestyle='--',color='gray')
plt.axvline(x=40,linestyle='--',color='gray')
plt.xlabel('Hour')
plt.ylabel('Normalize Hourly Presences')
plt.legend()
Out[133]:
<matplotlib.legend.Legend at 0x7fbc7881c908>
In [134]:
df_pivot_merge_sum['id'] = df_pivot_merge_sum.IDloc.map(lambda x: x.split(sep='_')[0])
df_pivot_merge_sum['lat'] = df_pivot_merge_sum.IDloc.map(lambda x: float(x.split(sep='_')[1].split('-')[0]))
df_pivot_merge_sum['long'] = df_pivot_merge_sum.IDloc.map(lambda x:float(x.split(sep='_')[1].split('-')[1]))
In [135]:
sid = list(set(df_pivot_merge_sum[df_pivot_merge_sum.pca_cluster==0].id))
len(sid)
Out[135]:
200

3.2. Plot all

In [86]:
for s_id in sid:
    LABEL_COLOR_MAP = {0:['silver',10], 1: ['orange',100], 2: ['red',100], 3: ['green',100]}
    plt.figure(figsize=(10,6))
    for i in range(0,len(set(df_pivot_merge_sum.pca_cluster))):
        plt.scatter(df_pivot_merge_sum[(df_pivot_merge_sum.pca_cluster==i) & (df_pivot_merge_sum.id==s_id)]['long'],df_pivot_merge_sum[(df_pivot_merge_sum.pca_cluster==i) & (df_pivot_merge_sum.id==s_id)]['lat'],color=LABEL_COLOR_MAP[i][0],s=LABEL_COLOR_MAP[i][1])
    #plt.title('ID: ' + str(s_id))
    plt.xlabel('Longitude')
    plt.ylabel('Latitude')

4. How about using ICA instead?

PCA finds the set of vectors that best explains the variability of the data, while ICA finds a set of independent vectors that can reconstruct the data.

PCA helps to compress data and ICA helps to separate data.

4.1 Compare PCA vs ICA

In [89]:
from sklearn.decomposition import FastICA, PCA
NUM_COMPONENTS = 4

# Compute ICA
ica = FastICA(n_components=NUM_COMPONENTS)
S_ = ica.fit_transform(x)     # Reconstruct signals
A_ = ica.mixing_.transpose()  # Get estimated mixing matrix

# For comparison, compute PCA
pca = PCA(n_components=NUM_COMPONENTS)
H = pca.fit_transform(x)  # Reconstruct signals based on orthogonal components
EV_ = pca.components_
In [90]:
plt.figure(figsize=(12,12))

models = [A_, EV_]
names = ["ICA","PCA"]
component_colors = ['red', 'steelblue', 'orange',"green","black"]

for ii, (model, name) in enumerate(zip(models, names), 1):
    plt.subplot(2, 1, ii)
    plt.title(name)
    for component, num in zip(model, list(range(NUM_COMPONENTS))):
        plt.plot(component,label = "Comp {0}".format(num))
    plt.legend(fancybox=True, framealpha=0.9, shadow=True, borderpad=1)
    plt.axvline(x=7,linestyle='--',color='gray')
    plt.axvline(x=18,linestyle='--',color='gray')
    plt.axvline(x=24,linestyle='-',color='white')
    plt.axvline(x=31,linestyle='--',color='gray')
    plt.axvline(x=40,linestyle='--',color='gray')
plt.xlabel('Hour')
plt.ylabel('Loadings')
plt.show()

4.2. Apply K-Means on ICA components

In [121]:
NUM_COMPONENTS = 8
# Compute ICA|
ica = FastICA(n_components=NUM_COMPONENTS, random_state=8888)
S_ = ica.fit_transform(x)     # Reconstruct signals
A_ = ica.mixing_.transpose()  # Get estimated mixing matrix
In [122]:
kmeans = KMeans(n_clusters=num_clusters, random_state=8888)
X_clustered = kmeans.fit_predict(S_)
df_pivot_merge_sum['ica_cluster'] = X_clustered
print(df_pivot_merge_sum[['ica_cluster','IDloc']].groupby(['ica_cluster']).count().reset_index())
wday_hr  ica_cluster   IDloc
0                  0  288056
1                  1     145
2                  2     165
3                  3     244
In [123]:
cluster_0 = df_pivot_merge_sum[df_pivot_merge_sum.ica_cluster==0][interested_cols]
cluster_1 = df_pivot_merge_sum[df_pivot_merge_sum.ica_cluster==1][interested_cols]
cluster_2 = df_pivot_merge_sum[df_pivot_merge_sum.ica_cluster==2][interested_cols]
cluster_3 = df_pivot_merge_sum[df_pivot_merge_sum.ica_cluster==3][interested_cols]
In [124]:
#Home hours: home hours (before 8 am and after 7 pm) as most other research
plt.figure(figsize=(10,6))
plt.plot(list(cluster_0.mean()), color='gray', label='Others')
plt.plot(list(cluster_1.mean()), color='green',label='Work')
plt.plot(list(cluster_2.mean()), color='orange', label='Home1')
plt.plot(list(cluster_3.mean()), color='red', label='Home2')
plt.axvline(x=7,linestyle='--',color='gray')
plt.axvline(x=18,linestyle='--',color='gray')
plt.axvline(x=24,linestyle='-',color='white')
plt.axvline(x=31,linestyle='--',color='gray')
plt.axvline(x=40,linestyle='--',color='gray')
plt.xlabel('Hour')
plt.ylabel('Normalize Hourly Presences')
plt.legend()
Out[124]:
<matplotlib.legend.Legend at 0x7fbc800258d0>

4.2.1. Number of unique IDs for ICA clusters vs PCA clusters

In [136]:
def ica_cluster_name(cluster_number):
    if cluster_number == 0:
        return "Others"
    elif cluster_number == 1:
        return "Work"
    elif cluster_number == 2:
        return "Home1"
    elif cluster_number == 3:
        return "Home2"
    else:
        return "None"
    
def pca_cluster_name(cluster_number):
    if cluster_number == 0:
        return "Others"
    elif cluster_number == 1:
        return "Home1"
    elif cluster_number == 2:
        return "Home2"
    elif cluster_number == 3:
        return "Work"
    else:
        return "None"
In [139]:
cnt = df_pivot_merge_sum[['ica_cluster','id']].drop_duplicates()
cnt = cnt.groupby(['ica_cluster']).count().reset_index()
cnt.columns = ['ica_cluster', 'Number of unique IDs']
cnt['cluster_name'] = cnt.ica_cluster.map(lambda x: ica_cluster_name(x))
cnt.sort_values("cluster_name")
Out[139]:
ica_cluster Number of unique IDs cluster_name
2 2 125 Home1
3 3 160 Home2
0 0 200 Others
1 1 115 Work
In [148]:
'# of unique IDs for Home: ',len(set(df_pivot_merge_sum[(df_pivot_merge_sum.ica_cluster==2)|(df_pivot_merge_sum.ica_cluster==3)]['id']))
Out[148]:
('# of unique IDs for Home: ', 182)
In [144]:
cnt = df_pivot_merge_sum[['pca_cluster','id']].drop_duplicates()
cnt = cnt.groupby(['pca_cluster']).count().reset_index()
cnt.columns = ['pca_cluster', 'Number of unique IDs']
cnt['cluster_name'] = cnt.pca_cluster.map(lambda x: pca_cluster_name(x))
cnt.sort_values("cluster_name")
Out[144]:
pca_cluster Number of unique IDs cluster_name
1 1 121 Home1
2 2 158 Home2
0 0 200 Others
3 3 132 Work
In [147]:
'# of unique IDs for Home: ',len(set(df_pivot_merge_sum[(df_pivot_merge_sum.pca_cluster==1)|(df_pivot_merge_sum.pca_cluster==2)]['id']))
Out[147]:
('# of unique IDs for Home: ', 178)

5. Conclusions

  • PCA and ICA have similar results when PCA's 1st component is not included as a feature.
  • In the data, Home location pattern is easier to capture than Work pattern.

TO DO:

  • More dynamic/objective way of identifying how many PCA components to keep, or how many ICA components to make
  • For people with multiple home/work location prediction, create a mechanism to select just one
  • For people without home/work location prediction, create a mechanism to make a prediction from other clusters.... or consider using fuzzy clustering methods.